[  HD-A132  828 ON  THE  USE  OF  MARGINAL  LIKELIHOOD  IN  TIME  SERIES  MODEL  1/1 
ESTIMATIONS)  WISCONSIN  UNI V-MADISON  MATHEMATICS 
RESEARCH  CENTER  G  TUNNICLIFFE-WILSON  JUL  83 
UNCLASSIFIED  MRC-TSR-2529  DAAG29-80-C-O041  F/G  12/1  NL 


Mathematics  Research  Center 
University  of  Wisconsin— Madison 
610  Walnut  Street 
Madison,  Wisconsin  53705 


July  1983 


(Received  June  3,  1983) 


OTIC  FILE  COPT 


Approved  (or  public  release 
Distribution  unlimited 


Sponsored  by 
» 

U.  S.  Army  Research  Office 
P.  O.  Box  12211 
Research  Triangle  Park 
North  Carolina  27709 


S3  09  22  127 


UNIVERSITY  OF  WISCONSIN  -  MADISON 
MATHEMATICS  RESEARCH  CENTER 


4  ON  THE  USE  OF  MARGINAL  LIKELIHOOD  IN  TIME  SERIES 

MODEL  ESTIMATION 

,  G.  Tunniciiffe-Wilson 

Technical  Summary  Report  #2539 
July  1983 

ABSTRACT 

This  paper  is  concerned  with  the  estimation  of  regression  models  with 
errors  which  follow  an  Autoregressive  Integrated  Moving  Average  (ARIMA) 
process.  The  effect  of  the  regression  upon  the  ARIMA  model  parameter 
estimates  Is  considered  and  marginal  likelihood  investigated  as  a  means  of 
overcoming  some  small  sample  bias.  Examples  illustrate  the  importance  of  this 
effect  even  in  samples  of  moderate  size.  The  consequences  regarding  inference 
for  the  regression  coefficients  are  also  discussed. 
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SIGNIFICANCE  AND  EXPLANATION 


t 


» 


When  simple  linear  regression  is  carried  out  to  relate  observations  of 
one  dependent  variable,  to  observations  of  one  or  more  independent  variables, 
it  is  important  to  allow  for  any  serial  correlation  in  the  errors.  This  helps 
to  avoid  nonsense  relationships  being  established,  by  giving  more  realistic 
values  for  the  precision  of  the  estimated  coefficients.  Unfortunately  the 
regression  tends  to  distort  the  errors  so  the  evidence  of  serial  correlation 
may  be  lost.  This  paper  proposes  that  modeling  of  the  error  correlation 
should  be  based  only  on  that  information  in  the  observations  Which  cannot  be 
distorted  by  the  regression.  Thus  it  uses  what  is  termed  the  marginal  like¬ 
lihood  criterion.  Several  examples  are  used  to  illustrate  how  this  can 
improve  inference,  particularly  concerning  the  question  of  whether  random  walk 
components  are  present  in  the  error.  Applications  include  models  of  economic 
series,  and  scientific  series  such  as  the  periods  of  peak  sunspot  activity. 
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ON  THE  USE  OP  MARGINAL  LIKELIHOOD  IN  TIME  SERIES 
MODEL  ESTIMATION 

G.  Tunnicliff e-Wilson 


1.  INTRODUCTION 

We  shall  consider  the  regression  model  y  ■  xa  +  e,  or  in  full: 

y.  -  ax.  +•••+  a„x_  +  e  ,  t  *  1  •••  n  (1.1) 

b  1  1  f  t  K  t  b 

where  the  observations  y  are  dependent  upon  fixed  regressors  x  which  in 
many  cases  will  be  deterministic  functions  of  time  t. 

It  is  well  known  that  estimation  of  the  coefficients  a  should  take  into 
account  the  covariance  structure  of  the  errors  e. 

Our  interest  is  in  situations  where  e  is  modelled  as  either  a 
stationary  process  with  continuous  spectrum,  or  as  an  integrated  stationary 
process,  using  the  ARIMA  (p,d,q)  models  presented  in  Box  and  Jenkins  (1976). 


Thus  typically  in  the  stationary  case. 


+•••+  ♦  e_  +  a  -  0,a.  .  0  a 

p  t-p  t  1  t-1  q 


q  t-q 


(1.2) 


or  +(B)e^  "  0( B)afc  where  afc  are  assumed  to  be  independent  Normal  (0,o  ). 
Using  the  representation  *(B)et  "  afc ,  where  x(B)  ■  +(B)/0(B),  and  assuming 
the  parameters  0  *  ($,0)  to  be  known,  efficient  estimates  of  the  regression 


coefficients  a  can  be  found  to  a  first  approximation  (i.e.  neglecting  end 


effects)  by  applying  ordinary  least  squares  (OLS)  to  y'  *  x'a  +  a,  where 


y;  -  w(B)yt  and  x'^  -  w(B)xfc  t> 

In  theory,  provided  efc  is  stationary,  application  of  OLS  directly  to 
y  and  x  provides  consistent  estimates  of  a.  If  further  the  variance  of 


x^  t  is  concentrated  in  a  relatively  narrow  frequency  band,  non-overlapping 
for  each  k,  the  actual  estimates  may  not  be  very  sensitive  to  the 


operator  or  filter  *(B)  that  is  chosen,  whether  or  not  it  corresponds  to  the 


true  error  structure.  In  other  cases,  the  estimates  of  a  may  depend 
markedly  on  the  choice  of  *(B).  In  all  cases,  the  standard  error  estimates 

A 

for  a  will  be  sensitive  to  the  error  model  (as  represented  by  w(B) )  and 
also.  It  follows,  will  be  any  assessment  of  significance.  The  effect  can  be 
quite  dramatic,  for  instance  when  the  alternatives  are  and 

Ve^  “  afc,  corresponding  to  OLS  applied  to  y  and  x  in  the  first  case,  and 
between  Vy  and  Vx  in  the  second  -  see  Box  and  Newbold  (1971).  If  correct 
inference  about  a  is  to  be  made,  the  error  model  for  e  should  therefore  be 
chosen  with  care  and  its  parameters  a  estimated  efficiently  -  even  though 
they  My  be  regarded  as  nuisance  parameters  with  respect  to  the  regression. 

We  shall  therefore  tend  to  concentrate  in  this  paper  on  estiMtion  of  the 
time  series  model  for  the  error  process.  We  shall  be  particularly  concerned 
with  investigating  the  presence  of  a  random  walk  component  in  the  error, 
because  this  can  have  a  substantial  influence  on  the  estimated  precision  of 
the  regression.  In  section  2  we  indicate  how  the  presence  of  regression  terms 
in  the  model  (1.1)  can  distort  the  error  structure  and  lead  to  incorrect  con¬ 
clusions  if  least  squares  criteria  are  used  without  caution.  Sections  3  and  4 
introduce  marginal  likelihood  as  an  estimation  criterion  which  is  capable  of 
overcoming  this  distortion  by  treating  the  regression  coefficients  as  nuisance 
parameters.  Section  5  considers  the  particular  cases  of  estimating  first 
order  autoregressive  and  first  order  moving  average  errors  when  the  regression 
is  simply  a  model  constant  term.  Section  6  considers  estimation  of 
seasonality,  section  7  tests  of  autoregressive  roots  of  unity,  section  8 
Intervention  modelling,  and  section  9  the  estimation  of  sinusoidal  regression 
components,  i.e.  a  mixed  spectrum. 


2.  DISTORTION  OP  RESIDUALS  BY  REGRESSION . 

A  reasonable  procedure  for  investigating  the  error  structure  is  first  to 
carry  out  OLS  regression  of  y  on  x,  or  of  Vy  on  Vx  if  differencing  is 
thought  to  be  appropriate.  The  residual  series  is  then  taken  as  an  estimate 
of  et  or  Ve^,  and  subjected  to  various  tests  for  autocorrelation,  and 
possibly  the  full  ARINA  model  identification  process  as  presented  by  Box  and 
Jenkins  (1976). 

The  possible  distorting  effect  of  the  regression  upon  the  autocorrelation 
has  long  been  appreciated.  Thus  when  the  regression  corresponds  simply  to 
mean  correction  of  y,  it  is  pointed  out  by  Kendall  and  Stewart  (1968)  p.  435 
that  'there  is  a  downward  bias  in  r j,  even  for  a  random  series',  and  when 
the  true  error  structure  is  AR(1),  that  'the  bias  in  all  these  cases  is  down¬ 
wards  and  obviously  may  be  quite  serious* .  Correlograms  which  should  decay 
(in  the  mean)  towards  zero  in  a  geometric  manner,  tend  to  swing  to  negative 
values  over  a  range  of  lags.  The  nature  of  the  bias  may  be  seen  in  the  way 
the  residual  spectrum  is  affected.  Mean  correction  removes  power  at  frequency 
»  ■  0  which  may  have  a  large  effect  on  AR(1)  models  with  positive  auto¬ 
correlation  at  lag  1,  and  on  MA(1)  models  with  negative  autocorrelation,  since 
much  of  the  information  about  the  parameters  in  these  models  occurs  at  low 
frequencies.  Regression  upon  sinusoidal  functions,  and  similarly  upon 
indicator  variables  for  seasonality,  distorts  the  residual  spectrum  at  other 
frequencies,  and  affects  inference  about  the  error  structure  as  we  see  later. 

Durbin  and  Watson  (1950,  1951)  focused  attention  upon  the  effects  of 
regression  whan  devising  their  test  for  autocorrelated  error.  Hie  vast  amount 
of  subsequent  work  in  this  area  is  the  subject  of  a  valuable  review  by  King 
(1983).  He  reports  that  likelihood  ratio  (LR)  tests  have  relatively  poor 
power  against  an  AR(1)  error  structure  with  positive  autocorrelation  when  the 


regressors  'are  smoothly  evolving' ,  and  blames  small  sample  bias  of  the  AR 
parameter*  Be  cites  other  authors  who  find  the  LR  test  unreliable. 

Perhaps  the  simplest  demonstration  of  how  inference  may  be  affected  is  to 
consider  the  case  of  level  and  trend  estimation  under  the  alternatives  that 
the  errors  are  random,  or  perform  a  random  walk.  Let  us  take 
Data  y1...yn 


Model  1 


c  ♦  bt  +  e* 


•t  "  at 


Model  2 


yt  *  c  +  bt  +  et  t  Vefc 


It  will  be  useful  later  if  we  note  here  that  under  model  1  the  OLS  estimates 


from  solving 


n(n+1  )/2 


n( n+1 )/2  n(n+1)(2n+1)/6l  b 


(2.3) 


Ohder  model  2  we  need  a  constraint  such  as  e,  *  0  to  estimate 

A  A  A 

Cj  ■  y^  -  b2,  hut  OLS  can  be  applied  to  7yfc  for  b2 

A 

E7yt  “  yn  ’  y1  “  <n"1)b2  *  <2.4) 

The  residual  mean  squares  of  these  regressions  nay  be  calculated  as  RMS^  - 

-1  *  *  2  -1  *  2 
(n-2)  £<yt  -  c^  -  b^t)  ,  FMS2  -  (n-2)  KVy^  -  bj)  ,  the  residual  degrees  of 

freedom  being  (n-2)  in  both  cases. 

It  would  not  be  unnatural  to  select  between  these  two  models  on  the  basis 
of  the  RMS  values.  It  is  however  possible  to  evaluate  their  expected  values 
under  the  two  models!  as  shown  in  Table  1. 


True  model 


(n+2)o  /1 5 


2no  /(n-1 ) 


!<v<r 


Thus  although  for  large  samples/  selection  on  the  basis  of  the  smallest  RMS 
value  would  lead  to  acceptance  of  the  correct  model/  we  note  that  for  n  <  13 
this  criterion  is  biased  in  favour  of  model  1  when  in  fact  model  2  is  correct* 
In  other  words  an  ordinary  linear  regression  fits  a  random  walk  better  than  a 
random  walk  model  does)  The  bias  is  particularly  galling  in  the  case  n  *  3 
when  in  fact  RMS 2  ”  3  RMS 1 ,  i.e.  they  are  proportional  statistics/  and  in 
fact  there  is  no  basis  for  discrimination. 

The  effect  is  no  doubt  worse  when  polynomials  of  higher  degree  are 
fitted,  and  may  explain  the  attraction  of  prediction  methods  which  fit/  say, 
cubics  to  the  latest  segment  of  a  series.  Although  for  a  random  walk  such 
fits  will  be  'good',  the  model  is  quite  wrong  and  gives  poor  predictions. 

The  ratio  RMS ^ /RMS 2  does  in  fact  have  excellent  properties  for  model 
discrimination  provided  that  one  does  not  simply  examine  whether  it  is  greater 
or  less  than  one,  but  Instead  takes  the  trouble  to  investigate  its 
distribution.  This  is  appreciated  for  example  by  Dicke  (1978),  who  fits  a 
trend  line  to  the  sequence  of  times  of  peak  sunspot  activity  shown  in  Figure 
1,  and  compares  the  observed  RMS  ratio  which  is  .87,  with  the  ratio  of 
expected  RMS  values  which  is  .46  under  model  1  and  1.72  under  model  2.  In 

A 

fact  he  uses  the  estimate  b^  in  both  RMS  quantities  but  this  has  relatively 
little  effect  on  the  formula.  He  takes  the  observed  value  of  .87  as  being 
more  compatible  with  model  1.  I  am  however  grateful  to  Maxwell  King  for 
supplying  me  with  the  1%  critical  value  of  .83  for  the  RMS  ratio  under  model 
1,  and  it  would  appear  that  the  observations  are  compatible  with  neither  model 
1  nor  model  2.  The  situation  is  complicated  by  the  introduction  of  another 
regression  variable,  the  peak  sunspot  number  associated  with  each  time 
point.  We  later  reconsider  this  data. 


•If 


INDEX  NUMBER  OF  PEAK 


Figure  1:  Peak  years  of  Sunspot  numbers. 


The  same  problem,  of  testing  for  residuals  which  follow  a  random  walk,  is 
considered  by  Sargan  and  Bhargava  (1983).  Although  exact  tests  against  a 
specific  autocorrelated  error  structure  are  valuable,  the  aim  of  this  paper  is 
to  stress  model  estimation  using  a  criterion  which  might  be  relied  upon  in 
general  situations  to  indicate  the  true  model. 


3.  ESTIMATION  CRITERIA 

He  start  with  the  distribution  of  the  data  for  the  model  (1.1)  and  this 


is  given  formally  by  the  density 


f(yl<*tfl»0)  *  <TnM-1/2exp{-l/2  Q/o2) 


(3.1) 


where  Q  *  e'V*e  is  a  quadratic  form  in  the  errors  e  “  y-xa,  and  the 

2  2 

matrix  V  has  elements  V,  ■  cov(e.,e  J/o  •  y .  Va  which  are  functions  of 


0  only,  being  the  theoretical  autocovariance  sequence  of  the  error 

model.  Also  N  *  det  V. 

In  practice  Q  is  calculated  without  explicitly  constructing  V  and  is 

2 

usually  dominated  by  a  sum  of  squares  term  Eafc  where  at  is  regenerated 
from  e^.  B.g.  for  the  ABMA(1,1)  model,  using  the  fact  that  (a^  -  e^)  is 
independent  of  a^ , • . . ,aR  we  get 

Q  -  mln{(-~*  l  (a.-e,)2  +  I  a2}  (3.2) 

a1  (♦-0)2  1  t 

where  for  t  ”  2...n,  at  »  +  There  is  some  computational 

advantage  in  defining,  somewhat  artificially,  eQ  ■  aQ  ■  (a^-e^ )/(♦—© )  which 
allows  t  to  run  from  1  to  n,  and  the  minimization  to  be  carried  out  wrt 


*0* 

Furthermore, 


M 


(1“<T) 


u-Q)2  (i-o2n) 

(i-*2)  (i-e2) 


for  0 


1  + 


n( !-♦)/( !+♦)  for  0 


1 

(3.3) 


This  example  will  be  useful  in  following  illustrations. 

» 

Ose  of  the  exact  likelihood  function  L(a,0,c|y)  *  f(y|a,0,o)  has  now 

been  supported  by  several  estimation  studies  in  the  case  of  univariate  models 

without  regression  components,  e.g.  by  Ansley  and  Newbold  (1980)  who  do  not 

even  introduce  a  constant  term  in  the  model.  When  regression  components  are 

present,  inferences  about  o  for  any  given  0  can  be  carried  out  by  direct 

maximization  of  L  using  generalized  least  squares,  since  e  is  linear  in 

'  2 

a.  This  yields  an  estimate  a  with  dispersion  matrix  a  D,  where  D 

2 

depends  only  upon  x  and  0  -  we  shall  write  D(0).  Thus  given  0  and  a  , 
(o-o)  is  MVH(O,o2D(0)). 
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s; 


A2  A  A 

Furthermore,  maximization  wrt  o  gives  a  =  Q(a,0)/n,  where  Q(a,0) 

is  the  minimum  of  Q  w.r.t.  a.  For  later  convenience  we  replace  n  by 

A2 

n-K  in  the  definition  of  0  ,  and  this  does  not  affect  the  result  that  the 
maximized  likelihood  is  now 

L<0)  «  M_1/2Q_n/2  -  {MVnQ}*n/2  .  (3.4) 

We  shall  call  the  quantity  M1^}  the  deviance,  Dev(0).  Zt  may  be 

regarded  as  an  objective  function  to  be  minimized  in  the  estimation  of  0  by 

maximum  likelihood.  In  general,  8  is  a  'nonlinear'  parameter  so 

optimization  methods  are  required.  The  strength  of  maximum  likelihood  rests 

very  much  on  its  proven  reliability  in  many  circumstances  for  comparisons 

within  a  parametric  family  of  models.  We  now  reconsider  the  problem  from 

section  2  by  viewing  the  two  models  as  cases  of  a  common  parametric  model: 
Model  3  yfc  ■  c  +  bt  +  e^  :  Ve^  *  (1-0B)at 

or  Vy^  -  b  +  (1-0B)afc  . 

The  IMA( 1,1)  model  for  the  error  allows  Model  1  in  the  case  6  *  1  and  Model 
2  in  the  case  0  *  0.  Using  formula  (3.3)  with  4>  *  0  then  gives  the  general 
likelihood  from  which  we  get  (replacing  n  by  n-1 ) 

for  Model  1  (0*1)  Dev  *  n1y^n  1  ^RMS  1 
for  Model  2  (0*0)  Dev  «  RMS2  . 

We  now  note  a  penalty  attached  to  Model  1  which  was  not  present  when  we 
evaluated  the  likelihood,  i.e.  BMS1,  without  the  differencing  feature  in  the 
error  structure.  This  penalty  exists  despite  the  fact  that  the  differencing 
is  'cancelled*  by  the  moving  average  operator  when  0*1. 

The  introduction  of  this  penalty  was  noted  by  Harvey  (1980)  when 
considering  the  problem  of  discriminating  between  regressions  in  levels  and 
first  differences.  He  used  a  general  regression  variable  xt  in  place  of  the 
trend  t,  which  we  shall  see  is  at  the  root  of  the  problem  in  this  case. 

-8- 
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The  penalty  n^n~  when  applied  to  RMS1  helps  to  meliorate  the  bias 
in  the  comparison,  but  does  not  overcome  it  until  n  *■  10.  Noting  that  Model 
3  effectively  eliminated  the  parameter  c,  we  continue  by  imposing  a  further 
differencing  structure  on  the  error,  to  eliminate  b; 

Model  4  y^  *  c  +  bt  +  efc  j  V2et  *  (1-0,|B-82B2)at 
or  V2yfc  =  (1-01B-82B2)at  . 

The  XMA(2,2)  model  for  the  error  now  allows  Model  1  in  the  case  8^  *  2, 

02  ”  -1,  and  Model  2  in  the  case  =  1,  @2  =  0.  It  may  now  be  shown  that 
the  exact  likelihoods  corresponding  to  these  are  given  by  the  mean  deviances 

for  model  1  Dev/(n-2)  -  {n2(n2-1)/12}1/(n_2)RMSl 

for  model  2  Dev/(n-2)  *  (n-l)1^”  2^RMS2 

This  gives  a  net  penalty  factor  of  {n2(n+1 )/12}*^n  2*  against  Model  1. 

With  these  factors  we  can  show  that 

E(Dev  1)  -  r  E(Dev  2)  when  model  2  is  true 
and  E(Dev  2)  *  s  E (Dev  1)  when  model  1  is  true 

where  both  r  and  s  are  greater  them  1  for  n  >  3.  When  n  *  3, 

Dev  1  =  Dev  2.  This  criterion  for  model  selection  now  seems  satisfactory  as 
Table  2  shows. 


1 


5 

6 

1.083 

1.142 

1.077 

1.121 

329  1.407 

232  1.263 


Table  2:  Ratios  of  expected  deviances  under  models  1  and  2 


Even  for  large  n  the  above  net  factor  ia  important,  implying  that  RMS1 
should  be  inflated  by  12%  when  n  *  100  and  7%  when  n  ”  200. 

4.  MARGINAL  LIKELIHOOD 

The  device  of  overdifferencing  that  seemed  to  solve  the  above  problem 
evidently  worked  because  it  removed  the  question  of  regression  parameter 
estimation.  Very  similar  problems  in  the  estimation  of  variance  components 
models  have  been  treated  by  using  marginal  likelihood,  and  Cooper  and  Thompson 
(1977)  have  applied  these  ideas  to  time  series  models.  The  fundamental  idea 
is  to  evaluate  the  likelihood  of  the  ARMA  error  structure  parameters  P , 
using  only  that  information  in  y  which  is  invariant  to  any  changes  in  the 
regression  coefficients  a.  This  is  what  differencing  achieved  in  our 
problem.  In  general  the  marginal  likelihood  can  be  evaluated  in  simple  steps, 
for  a  proposed  parameter  value  0. 

A 

(i)  Estimate  a,  obtain  Q(a(0)  and  N  *  1/det  D(0) 

(ii)  The  marginal  deviance  is  Mev(0)  *  {NM}1y^n  K^Q  where  M  is  the 
same  factor  as  appeared  in  the  exact  likelihood.  In  practice  N  is  evaluated 

A 

as  the  determinant  of  the  matrix  appearing  in  the  GLS  equations  for  u,  so  is 
obtained  as  a  by  product  of  solving  these  equations* 

The  expression  given  by  Cooper  and  Thomson  is  equivalent  to  Mev(0) 
apart  from  the  fact  that  they  effectively  scale  N  =  N(0)  by  dividing  by 
N(0).  The  advantage  is  that  Mev  is  then  invariant  to  a  reparameter ization 

A*  A*  •  ^ 

of  the  regression  by  setting  say  o  3  Pa,  x  =  xP  .  We  shall  omit  this 
scaling,  though  bearing  it  in  mind  if  we  are  tempted  to  compare  models  with 
different  regressors.  The  advantage  is  that  as  defined  in  (ii),  Mev  is 
invariant,  apart  from  the  usual  Jacobian,  to  linear  data  transformations 

a  a 

(y,x,e)  =  R(y,x,e).  Differencing  and  other  simplifying  operations  such  as  are 


M 


considered  by  Abraham  and  Box  (1978)  are  the  moat  usual  of  such 
transformations • 

Applying  the  above  definition  of  Mev  in  the  case  of  Models  1  and  2 
leads  very  simply  to  the  factors  modifying  RMSl  and  RMS2  as  the  determinants 
of  the  least  squares  equation  matrices  in  (2.3)  and  (2.4). 

the  expression  for  the  Marginal  Likelihood  is  obtained  by  the  usual 

*  A2  2 
arguments.  Because  a  and  a  are  sufficient  for  a  and  o  (and  are 

independent),  the  marginal  likelihood  which  represents  information  in  y 

invariant  both  to  translations  of  a  and  scale  change  of  a,  is  given  by 

A  A 

f<y|a#B#0)« 

Using  the  fact  that  (a-a)/o  is  MVN [0, D(S)1  and  (n-K)o2/o2  is  chi- 

A  A 

squared  on  (n-K)d.f.,  we  can  factor  out  the  densities  of  a,  6  from  (3.1) 
to  obtain 

f(y|a,fi,o)  «  (Mevr(n”K>/2  .  (4.1) 

A  A 

We  shall  also  use  the  fact  that  (a-a )/o  has  a  multivariate  T  distribution 
with  matrix  parameter  D(0),  or  MVT[D(0)]. 

The  derivation  by  Cooper  and  Thompson  is  somewhat  different,  and  the 
above  corresponds  more  closely  to  that  presented  by  Levenhach  (1972)  for  auto¬ 
regressive  models.  The  use  of  marginal  likelihood  in  this  kind  of  situation 


J 


is  in  accord  with  the  ideas  of  Kalbfleisch  and  Sprott  (1970).  Cooper  and 
Thompson  also  mention  that  it  is  possible  to  derive  (2.5)  by  Bayesian 
arguments.  Taking  the  usual  uniform  priors  on  a  and  log  o  gives  the  same 
result  after  integrating  out  a  and  a.  It  is  then  useful  to  Interpret  (4.1) 
as  f ( 0 |y) ,  the  posterior  density  for  0  given  a  uniform  prior.  The  same 


distributional  property  as  above  still  holds  for  (a-o)/o,  but  now  with  the 
interpretation  that  it  specifies  f(a|0,y).  Zellner  and  Tlao  (1964)  present 
the  Bayesian  approach  for  the  case  of  AR(1)  error  structure. 
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y  v  */  • .  • 


Because  Mev ( 6 )  *  {nm} 


1/(n  Q(a,g)#  and  neither  H  nor  M  depend 

a 

on  a,  it  is  tempting  to  take  NM1/^n  *^Q(a,&)  as  an  overall  objective 
function  to  be  minimised  wrt  a  and  0.  Although  this  may  be  useful  from  a 
computational  viewpoint,  the  interpretation  of  this  function  is  difficult  and 
its  use  in  tests  which  simultaneously  involve  a  and  0  is  not 
recommended.  He  shall  however  be  so  bold  as  to  use  contours  of  Hev(B)  to 
define  confidence  regions  and  carry  out  mean  deviance  tests  by  referring 

a  A  a 

{Mev(0)-Mev(B)}/{Mev(P)/(n-K-L)}  to  chi-squared  on  I*  d.f.  where  0  is  the 
minimum  deviance  estimate  of  6,  and  L  is  the  number  of  ARIMA  model 
parameters.  If  differencing  is  present  in  the  ARIMA  model,.  K  includes  the 
reduction  in  error  series  length  due  to  the  differencing. 

5.  CONSTANT  TERM  ESTIMATION 

Considering  the  error  models  efc  *  afc  (random  error)  and  Vefc  *  afc 

(random  walk  error)  as  two  extremes,  it  is  possible  to  bridge  these  extremes 

by  either  of  two  simple  models,  the  AR(1)  model  efc  *  ^et-1  +  at'  4113  the 

IMA (1,1)  model  Ve.  *  a.  -  0a.  . .  When  $  *  0  or  0*1  we  obtain  the 
t  t  t~1 

random  error,  when  $  *  1  or  0*0  we  obtain  the  random  walk. 

Introducing  the  constant  term  in  the  AR(1)  case  corresponds  to  a  constant 

term  in  the  regression.  Introducing  a  constant  in  the  IMA(1,1)  case  is 

equivalent  to  a  trend  regression  for  y,  i.e.  yfc  *  ct  +  efc,  whence 

Vyfc  *  c  +  afc  -  0a fc_ 1 .  Such  models  occur  frequently  in  practice. 

The  marginal  likelihood  for  AR(p)  models  with  a  constant  terra  was 

considered  by  Levenbach  (1972).  He  gave  some  explicit  formulae  and  contour 

maps  for  the  marginal  likelihood.  In  the  case  of  the  AR(1)  model  the  contrast 

2  “1  /n 

is  between  Dev  *  ( 1-$  )  Q  for  the  exact  likelihood,  and  Mev  * 
{l+n(1-4)/(1+$)}  1  for  marginal  likelihood.  The  factor  in  the  former 


is  well  known,  and  corresponds  to  (2.3)  when  0  *  0.  The  factor  in  the  latter 
nay  be  got  by  the  device  of  taking  the  exact  likelihood  of  the  differenced 
series,  and  thus  corresponds  to  (2.3)  when  6-1.  The  main  point  to  note  is 
that  the  exclusion  of  ♦  -  1  in  the  exact  likelihood  is  absent  in  the 
marginal  likelihood.  A  valid  comparison  of  the  random  walk  model  correspond¬ 
ing  to  ♦  -  1 ,  can  then  be  made  with  the  AR(  1 )  model  with  4  <  1.  In  the 

n  , 

limit  as  4  ♦  1#  Mev  ♦  £  (Vy  )  ,  but  there  is  numerical  instability  in  the 

2 

calculation  of  Mev  from  the  AR( 1 )  model  as  $  +  1 .  Both  in  the  calculation 

A 

of  c  which  has  a  well  defined  limit  of  (y^  +  yn)/2,  and  of  the  product 
MM,  ratios  of  quantities  with  the  common  factor  (1-+)  occur.  It  is 
possible  for  Mev  to  be  decreasing  with  strictly  negative  gradient  up  to 

2  n  2 

♦  •  1.  In  fact  at  ♦  “  1,  3 log  Mev/3$  *  1/2 {1  -  (y  -y  )  /\  (Vy  )  }. 

n  i  2  ^ 

Although  this  has  expected  value  zero  if  yt  is  indeed  a  random  walk,  it  may 
be  negative  if  yfc  has  a  strong  trend. 

The  marginal  likelihood  for  the  MA(1)  model  with  constant  term  was  the 
subject  of  simulation  studies  by  Cooper  and  Thomson  (1977).  The  quantity 
Dev  in  this  case  has  as  the  multiplier  of  Q,  the  factor  M^n,  where  using 
(2.3)  with  ♦  -  0,  we  have* 

M  -  (1-02n+2)/(1-02)  for  62  1,  and  M  -  n  for  02  -  1  . 

The  multiplier  of  Q  in  Mev  is  (MN)^n-^  where 
/" 

ln(1-02n+2)/(1-02>  -  (1-0n+1)2/{1-0)2]/(1-0)2,  for  02  <  1 

MN  -  n2(n2-1)/12  ,  for  0  -  1 

2  2 

n  /4  when  n  is  odd,  (n  -1)/4  when  n  is  even,  for  0  *  -1 


Symmetry  considerations  show  that  both  Dev  and  Msv  always  have  stationary 
points  at  the  boundaries,  8  ■  ±1#  so  that  minimum  Dev  or  Mev  values  can 
occur  on  the  boundaries* 


Mote  that,  as  for  the  AR{ 1 )  nodal,  the  value  of  MN  is  n  at  the  origin 

i.e.  6*0  in  this  case,  so  the  effect  on  minimum  Mev  estimation  is  seen 

by  comparison  with  this  value.  The  value  6*1  is  ouch  more  strongly 
discouraged  by  MM  than  by  M  alone,  whereas  6  *  -1  is  slightly  less 
discouraged. 

For  both  the  AR(1)  and  MA( 1 )  models,  the  use  of  marginal  likelihood 
counteracts  the  tendency  of  mean  correction  to  bias  sample  autocorrelation 

downwards.  For  the  MA(1)  model  with  0  <  6  <  1  it  is  known,  see  Cryer  and 

A  A 

Ledolter  (1981),  that  the  estimates  6  have  a  tendency  to  pile  up  at  8*1 
even  when  the  mean  level  is  not  estimated,  but  set  to  its  true  value.  Thus 
Cooper  and  Thompson  report  that  out  of  200  simulations  of  a  series  of  length 

A 

n  *  59  with  8  -  .7539,  the  value  8*1  occurred  8  times.  When  the 
series  mean  ms  estimated,  still  using  exact  likelihood,  this  leapt  to  65 
times  out  of  the  200.  When  marginal  likelihood  was  used  it  fell  back  to  19 

times. 

The  distinction  between  8*1  and  8  <  1  is  important  in  the  model 

yt  *  yt_1  +  c  +  afc  -  ®at-l  wkich  has  a  structural  interpretation  for  yt  as 

a  signal  plus  noise,  the  signal  being  a  random  walk.  Thus  yfc  =  wfc  +  nfc 

where  wfc  ■  wfc-1  +  c  +  and  nfc  -  3fc,  both  and  being  independent 

2  2 

Normal  with  zero  means  and  variances  0  ,  o. .  The  variance  ratio 

a  p 

2  2  2 

r  *  °a/°0  A*  related  to  8  bjf  (1-8)  /8  ■  r,  the  value  8*1  correspond¬ 
ing  to  absense  of  the  random  walk  component,  apart  from  a  fixed  increase  c 
per  unit  time.  If  8  ■  .7539  then  r  ~  1/12  and  over  a  series  length  60 
the  random  walk  component  will  build  up  a  variance  of  5  times  the  noise 
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level,  and  should  be  detectable.  The  simulation*  of  Cooper  and  Thoapson  show 
how  frequently  the  wrong  conclusion  that  0  ■  1  can  arise  in  this  situation 
when  c  is  estimated,  unless  Marginal  likelihood  is  used. 

He  now  move  on  to  consider  particular  data  sets  which  illustrate  how  the 
use  of  marginal  likelihood  can  affect  inference. 

6.  ESTIMATION  OF  SEASONALITY 

We  consider  the  selection  between  a  Model  which  represents  seasonality 
using  fixed  seasonal  coefficients  applied  to  indicator  variables,  and  one  in 
which  the  Box-* Jenkins  seasonal  model  is  used.  A  fixed  linear  trend  is  also 
estimated  in  both  models.  We  shall  use  the  series  of  144  points  of  the 
logarithms  of  airline  passenger  totals  as  analysed  by  Box  and  Jenkins  (1976) 
and  shown  in  Figure  2. 


Figure  2:  Airline  Passenger  Totals 


Let  us  take: 


S  *  1  in  month  j,  else  is  0. 
J 


so  that  in  fact  model  (ii)  with  8-1  is  in  theory  identical  to  (i). 

The  error  efc  from  an  OLS  fit  of  model  1  is  evidently  strongly  auto- 
correlated  and  as  a  first  step  we  represent  it  by  an  AR(1)  model, 

(1-4B)et  -  Then  model  (i)  is  a  fairly  classical  decomposition  model. 

The  results  of  fitting  these  models  using  exact  likelihood  are  summarized 
in  rows  1  and  2  of  Table  3.  We  note  that  the  minimum  Dev  occurs  with  the 
fixed  seasonality  model  1.  Reports  to  the  effect  that  snch  fixed  seasonality 
models  fit  better  than  Box- Jenkins  models  have  been  heard  by  the  author  for 
same  years.  The  fitting  criterion  has  usually  been  least  squares,  i.e. 
minimisation  of  Q  rather  than  Dev  -  M^nQ,  but  this  is  very  similar  to 
exact  likelihood  at  parameter  values  such  as  the  above,  which  are  reasonably 
well  distant  from  the  boundaries,  nevertheless  soon  puzzlement  has  resulted, 
because  on  the  L.S.  criterion  model  2  is  more  general  than  model  1,  and  should 
obtain  a  residual  sum  of  squares  at  least  as  good  as  model  1.  The  explanation 
is  that  the  least  squares  surface  for  model  2  has,  besides  a  local  minimum  at 

A 

0,  a  sharp  dip  at  6  -  1  to  what  in  this  case  Is  a  lower  point  corresponding 
to  model  1.  The  exact  likelihood  multiplier  M  fills  this  dip,  and  rather 
fortuitously  the  back-forecasting  method  of  Box  and  Jenkins  if  insufficiently 
Iterated  (as  in  the  early  software)  has  a  similar  effect  which  avoids  8-1. 
Marginal  likelihoods  however  give  proof  to  support  the  use  of  Box-Jenkins 
seasonal  models  which  may  formerly  have  required  some  faith. 


12 

model  Ii  y  -  bt  +  l  M.S.  + 
t  3  3,t 

where  for  j  -  1...12, 

model  2s  y  »  bt  ♦  *  e 

*  (1-B  )  Z 
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Model 

Dev  or  Mev 

3 

S 

RDF 

b 

A  A 

SE(b) 

A 

♦ 

A 

0 

A 

e 

1 

.17721 

.001354 

130 

.00999 

.00032 

.787 

2 

.18874 

.001401 

129 

.00996 

.00058 

.779 

.577 

3 

.18421 

.001416 

130 

.00950 

.00238 

.262 

4 

.18296 

.001369 

129 

.402 

.557 

KM) 

.23901 

.001354 

130 

.00999 

.00035 

.803 

2(M) 

.20110 

.001403 

129 

.00995 

.00063 

.792 

.568 

3(M) 

.24009 

.001416 

130 

.00949 

.00233 

.261 

Table  3t  Summary  of  models  fitted  to  the  'airline  data' 


Fitting  Models  1  and  2  by  marginal  likelihood  criteria  gives  the  results 
shown  in  rows  1(M)  and  2(m)  of  Table  3.  The  picture  has  changed  considerably 
and  model  2(M)  is  now  much  more  strongly  favoured  than  1(M).  Although  the 
parameter  values  are  little  changed,  a  heavier  'penalty'  has  been  introduced 
to  model  1 (M) ,  arising  from  the  use  of  seasonality  indicators. 

We  continue  by  considering  the  model  which  was  actually  used  by  Box  and 
Jenkins  for  this  data.  It  involved  a  further  stage  of  differencing  by  taking 
for  the  error  structure 


iiiSfil. 

(1-B)  t 


Using  this  in  models  1  and  2  gives 


model  3j  Vy.  -  b  ♦  ZM.S,  .  ♦  (1-0B)a.  t 

t  t 

model  4:  VV^y^  «  (1-0B)(1-0B1*)at  ; 


the  last  one  being  of  course  the  Box-Jenkins  model 


Estimation  by  axact  likelihood  gave  the  results  in  rows  3  and  4  of  Table 


A 


II 


3.  Although  the  Box-Jenkins  model  4  is  now  slightly  preferred  to  models  2  and 
3,  it  still  takes  second  place  to  model  1.  Again  however,  looking  at  the 
results  for  marginal  likelihood,  we  see  that  model  4  -  for  which  exact  and 
marginal  likelihood  coincide  -  is  now  much  to  be  preferred.  As  a  check,  model 
4  was  refitted  with  6  constrained  to  1.0,  which  should  be  equivalent  to 
model  3(M)  and  did  in  fact  give  the  same  value  of  Mev  and  estimate  of  6. 

A 

Mote  incidentally  how  little  the  trend  estimate  b  in  Table  1  changes  through 
models  1,  2  and  3,  though  its  standard  error  estimate  changes  substantially. 


7.  DISCRIMINATING  AR  ROOTS  OP  UNITY 

To  illustrate  the  problem  we  first  look  at  a  data  set  of  200  points  of 
the  series  of  monthly  unemployment  in  Scotland  from  January  1952  to  August 
1968,  as  shown  in  Figure  3.  Two  competing  models  are,  after  applying  a 
fourth-root  transformation  to  the  data, 

model  1  d-+1B-+2B2){V12yt-c}  •  ( 1-0B)  ( 1-©B12)at 

model  2  (H,B)7T  y  -  (1-6b)  (1-0B12U  . 


0  18  24  36  46  60  72  84  96  106180138144196168160198804 
MONTHS 


Figure  3:  Unemployment  in  Scotland. 


2 

In  nodal  1,  as  a  root  of  (l-^B-^B  )  approaches  1,  i.e. 

+  #2  ♦  1,  the  exact  likelihood  becomes  0,  or  the  Deviance  infinite.  In 
this  limit,  model  1  becomes  model  2,  but  although  an  exact  likelihood  for 
model  2  can  be  defined  on  the  basis  that  VV^y^  is  stationary,  this  is 
evidently  not  directly  comparable  with  that  for  model  1.  However,  because  of 
the  presence  of  the  constant  c  in  model  1,  the  marginal  likelihood  may  be 
used  in  this  comparison,  since  it  is  based  on  the  same  information,  i.e.  that 
in  Th®  results  from  models  1(M)  and  2  in  Table  4  may  be  used  in  a 

'mean  deviance*  test  by  referring 

(1.107046  -  1.067656)  x  183/(1.067656)  -  6.752 
to  the  chi-squared  distribution  on  1  d.f.,  from  which  it  might  reasonably  be 
inferred  that  model  1  is  to  be  preferred,  despite  the  fact  that 

A  A 

41  ♦  4.  ”  .988  ~  1.  This  is  to  a  large  extent  confirmed  by  superior  fore¬ 

casts  produced  by  model  1  over  the  next  few  years.  The  operator 
*  -  2 

d-tjB-^B  )  corresponds  to  a  spectrum  which  peaks  around  a  period  of  5  years 
and  results,  in  the  long  term,  in  a  forecast  function  with  a  damped  cyclical 
pattern.  Model  1  is  similar  in  the  short  term,  but  does  not  imply  a  turning 
point  in  the  trend  of  its  forecasts. 


Model 

” 

Dev/Me v 

RMS 

- 1—: — 

MW  |  ♦, 

1 

A 

*2 

A 

e 

A 

e 

- 1 

A 

c 

1 

r 

1.061129 

.005339 

183  i  1.9128 

-.9250 

.6804 

.7727 

.0757 

KM) 

1.067656 

.005351 

183 

1.9086 

-.9201 

.6656 

.7664 

.0766 

2 

[1.107046  j. 005644 

184 

.8709 

.5381 

.7909 

i 

Table  4:  Suamary  of  models  fitted  to  unemployment  data 
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It  may  be  objected  that  the  exact  likelihood  in  row  1  of  Table  4  is  only 
slightly  less  than  the  siarginal  value  in  row  1 (M) ,  and  would  lead  to  the  same 
conclusion.  To  some  extent  this  is  coincidence  -  even  at  the  origin  of  the 
ARMA  parameters ,  the  definition  of  Mev  is  some  3%  greater  than  that  of 
Dev.  A  further  example  in  this  section  shows  that  the  exact  likelihood  for 
the  'undifferenced'  model  can  exceed  that  for  the  differenced  model,  so  that  a 
'mean  deviance'  quantity  is  negative  and  evidently  meaningless. 

In  this  next  case  then,  we  examine  150  points  of  annual  measurements  of 
day  length  variations  as  given  by  Luo,  et  al  (1977)  and  shown  in  Figure  4. 

The  data  has  been  smoothed,  which  introduces  MA  terms  in  the  model,  but  also 
has  very  much  the  appearance  of  a  Random  Walk  (R.W.)  and  the  model  choice 
reflects  this  possibility 

model  1  (1-41B-^2B2)(yt-c)  *  (l-S^B-SjB^a^^ 

model  2  O-^BjVy^  -  (l-O^B-OjB2^  . 


Figure  4:  Changes  in  day length 


Model 

Dev/Mev 

mm 

RDF 

A 

*1 

A 

♦2 

A 

81 

A 

C 

1 

37990.88 

250.48 

145 

1.3881 

-.4158 

-.5366 

-.6148 

1(M) 

37044.50 

250.55 

145 

1.3872 

-.4034 

-. 5440 

-.6173 

2 

37154.62 

251.44 

146 

.3869 

-.5542 

-.6207 

T»bl<  5 1  Su— ary  of  Models  1  and  2  fitted  to  the  day  length  data 


We  first  note  that  Dev  for  model  1  exceeds  that  for  model  2  (for  which  DEV 
■  MEV) .  The  mean  deviance  test  statistic  from  the  results  in  rows  1(H)  and 
2  is  now  0.431  and  we  might  conclude  that  there  is  insufficient  evidence  to 
reject  the  RW  type  model  2.  We  do  however  take  a  further  look  at  this  data  in 
section  9. 

We  now  re-examine  the  peak-sunspot  data  of  Figure  1,  taking  the  model 

yt  *  c  +  bt  +  dxfc  +  et  j  t  »  1 ... 25  .(7.1) 

where  xfc  is  the  corresponding  sequence  of  sunspot  numbers  at  the  peaks,  and 
we  take  an  AR(1)  error  model  efc  •  ♦e^  +  afc.  Interest  lies  mostly  in  the 
estimate  of  b  and  its  precision.  When  $  ■  0,  i.e.  with  random  error,  we 

A 

have  b  ■  11.136  (SE.042)  and  RMS  2.094  whereas  with  ^  *  1,  i.e.  random  walk 
error,  we  have  11.051  (SE.302)  and  RMS  2.181.  Estimating  4  by  exact  like- 

A  A 

lihood  gives  ♦  “  .472  (SE  .216)  and  b  ■  11.113  (SE  .065),  but  using  marginal 

A  A 

likelihood  ^  ■  .625  (SE  .217)  and  b  *  11.102  (SE  .085).  In  such  a  situation 
the  use  of  a  point  estimate  for  t  does  not  give  the  fuil  picture,  and  in 
Figure  5  we  show  the  marginal  likelihood  function  for  $  together  with  the 
estimates  of  b  and  of  its  standard  error.  Although  $  ■  0  seems  definitely 
ruled  out,  ♦  ■  1  still  appears  to  be  a  possibility.  Taking  a  Bayesian 
interpretation  of  the  marginal  likelihood  as  the  marginal  density  of 

A  A 

♦,  i.e.  f(+|y),  and  taking  the  t22  density  of  (b-b)/SE(b)  as  specifying 
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the  density  f (b | 4,y)  we  can  numerically  construct  the  posterior  for  b  as 


/  f (b|4,y)f (+|y)d+.  This  is  shown  in  Figure  6.  The  mode  is  at  b  *  11.115, 
a  95%  confidence  interval  for  b  is  (10.815,  11.325)  and  a  99%  interval  is 
(10.62,  11.49).  These  are  much  wider  than  those  found  by  taking  4  fixed 
at  .625.  Further  studies  reveal  that  points  7,  8  and  9  of  the  data  have  a 
considerable  Influence  and  other  error  models  which  'follow*  these  closely, 
e.g.  MA(2 ) ,  give  a  somewhat  different  answer.  The  answer  to  Dicke's  question 
may  therefore  not  be  clear,  but  it  usefully  illustrates  the  statistical 
considerations . 


ESTIMATION  OF  SUNSPOT  PERIOD 
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Figure  5:  Marginal  likelihood  (solid  line),  estimated  period 
of  sunspot  peaks  minus  ten  years  (long  dashed  line) 
and  standard  error  of  the  period  (short  dashed  line), 
as  functions  of  the  autoregressive  parameter. 


MARGINAL  DENSITY  OF  SUNSPOT  PERIOD 
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Figure  6:  Marginal  poatarlor  density  of  the  sunspot  period 

8.  INTERVENTION  MODELLING 

A  strong  argument  for  the  use  of  marginal  likelihood  may  be  found  in  the 
treatment  of  missing  observations  from  time  series  data,  and  the  effect  on 
ARIMA  parameter  estimates.  This  can  be  cast  as  a  regression  problem  by  using 
Intervention  analysis  techniques  -  see  Box  and  Tlao  (1975)  -  and  we  extend  our 
consideration  to  such  simple  intervention  patterns  as  step  changes  in  the 
level  of  a  series. 

To  appreciate  the  possible  effect  consider  a  series  for  which  a  simple 
MA( 1 )  model  is  appropriate,  and  is  fitted  using  the  exact  likelihood 


criterion: 


Dev  -  {(1-e2n+2)/(1-82)}1/nQ(8) 


(8.1) 


where  Q(0)  ■  min  S(6|y)  and  the  sum  of  squares  S  is  given  by 
*0  n  2 

S(8|y)  “  £  a  where  a  ■  y  +  9a  -  for  t  *  1...n 

t-0  t  t  t-i 

Suppose  now  is  missing  for  some  0  <  r  <  n-1.  It  may  be  treated 

by  introducing  a  regression  (intervention)  variable  It  which  is  an  impulse 

A  A  A 

at  t  ■  r+1,  after  first  setting  yr+1  *  0.  Then  y  =  -u  where  a  is 
the  estimated  regression  coefficient. 

It  is  easily  appreciated  that  in  this  case  we  can  instead  take  yr+1 ,  or 
equivalently  a^.,  as  the  nuisance  parameter  so  that  Q(8)  =  min  S(6|y) 


in  the  definition  of  Dev. 


a0'*r+1 


Now  note  that  if  yr+f  is  missing,  we  have  for  this  model  two 
independent  samples  y1 ,  •  •  •  ,yr  and  y^j,  • • • ,yn,  of  lengths  r  and  s  ■ 
n-r-1,  so  that  the  true  likelihood  of  the  actual  observations  (forgetting  the 
missing  value)  is  found  as  the  product  of  the  likelihoods  for  the  two  parts. 
This  leads  to  the  following  expression  for  the  deviance  which  must  of  course 
equal  the  marginal  deviance  from  the  regression  formulation,  since  it  is  based 


on  just  the  available  observations. 


Mev  -  {(1-02r+2)(1-828+2)/n-02)2>1/(n"1)Q(0) 


(8.2) 


where  Q(0)  is  exactly  as  defined  above. 


Incidentally,  the  interpolated  value  is  y^+ 1  *  ar+1  “  ®ar  *  yp  +  y 
where  y^  is  the  forecast  of  y^^  based  upon  the  first  part  of  the  data,  or 

A 

the  past  wrt  yr+1 ,  and  yp  is  the  back  forecast  based  on  the  future  part  of 
the  data.  This  is  of  course  a  direct  consequence  of  the  independence  of  the 
two  parts  for  this  model. 

Assuming  that  Mev  is  calculated  from  the  intervention  regression 


formulation,  the  factor  N  in  the  general  expression  for  Mev,  converts 
Dev  in  (8.1)  into  Mev  in  (8.2)  for  this  example.  Note  that  e.g.  for  a 
series  of  length  n  “  100,  a  missing  value  at  t  *  60  increases  the 


multiplier  of  Q  from  1.047  for  Dev  to  1.082  for  Mev  at  0*1 


A  more  commonly  occurring  situation  which  is  precisely  equivalent,  is 
when  a  step  function  is  introduced  to  model  a  change  in  level  at  time  r+1  in 
a  series  which  follow  an  IMA(1,1)  model,  i.e. 


S.  .ifcsm 


V  *t 


t 

where  St  *  0  for  t  <  r  and  1  for  t  >  r.  On  differencing,  we  get 
Vy  *  1^  +  (1-0B)at,  the  model  just  considered. 

We  now  have  the  interpretation  that  in  fitting  such  a  step,  one  is 


removing  some  evidence  for  a  random  walk  (signal)  component  in  y^,  as 


discussed  in  section  5.  The  use  of  marginal  likelihood  acts  against  the 
corresponding  tendency  for  0  to  shif t  towards  1 . 

As  an  illustration,  consider  the  slightly  more  complex  case  of  122  values 
from  the  latter  part  of  the  Beverage  wheat  price  index  as  shown  in  Figure  7. 

An  IMA( 1,3)  model  appears  sensible  for  this  data  after  a  logarithmic  transfor¬ 
mation,  and  a  constant  term  is  included  because  of  the  firm  positive  trend 
over  the  historical  record  of  this  series.  At  points  t  *  70  and  71  in  our 
subseries,  there  appears  to  be  a  general  shift  down  in  the  series  level, 
following  a  slightly  high  value  at  t  *  69.  Three  interventions  are 
introduced*  an  Impulse  at  t  *  69  and  steps  at  each  of  t  *  70,  71.  This  has 
the  effect  of  rendering  the  differenced  section  from  t  *  2... 68  independent 
of  that  from  t  *  72... 122,  given  the  model.  From  Table  6  it  is  seen  that 


without  the  intervention  estimation  the  steps  down  are  taken  as  evidence  for 


2  3 

the  RW  signal  component,  and  there  is  no  hint  that  0(B)  *  l-O^B-O^  -03b 


has  a  root  of  B  ■  1.  Estimations  by  Dev  in  row  1  or  Mev  in  row  1(M)  are 
then  similar.  However,  estimation  of  the  three  intervention  effects  by  exact 
likelihood  as  shown  in  row  2,  leads  to  a  root  of  0(B)  at  one,  implying 
absence  of  any  random  walk  component.  The  use  of  marginal  likelihood  counters 


this  effect  ee  seen  in  row  2(M).  Although  standard  errors  are  not  given, 

A  A  A 

the  t  values  were  1.82  for  X69,  1.09  for  S70  and  2.78  for  S71. 
the  only  real  effect  seems  to  be  associated  with  S71 ,  although  the  drop 
from  Ygg  to  appeared  large. 


Figure  7:  Beveridge  Wheat  Price  Index 


Model 

Dev/Mev 

RMS 

RDF 

A 

A 

92 

A 

03 

1-E0. 

X 

B 

A 

169 

S70 

<  M 

1 

2.0971 

.01782 

117 

-.059 

.406 

.314 

.339 

.006 

1(M) 

2.2189 

.01784 

117 

-.073 

.390 

.303 

.380 

.006 

2 

1.7284 

.01461 

114 

.091 

.516 

.393 

.000 

.011 

.225 

-.155 

-.346 

2<M) 

1.8864 

.01513 

114 

.038 

.459 

.341 

.162 

.011 

.217 

-.160 

-.328 

Table  6:  Summary  of  models  fitted  to  the  wheat  price  index. 
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It  must  be  remembered  that  these  conclusions  are  affected  by  the  choice 
of  sample  size.  For  short  samples  it  will  be  more  difficult  to  detect  the  RW 
component  even  using  Mev,  and  it  wili  be  easy,  even  using  Dev,  in  large 
samples.  It  can  be  expected  however  that  over  a  good  range  of  sample  sizes, 
Mev  may  be  found  useful, 

9.  MIXED  SPECTRUM  ESTIMATION 

The  search  for  deterministic  sinusoidal  components  in  time  series  is 
still  of  great  interest  in  many  fields.  The  significance  of  such  a  discrete 
component,  at  a  given  frequency,  depends  upon  how  prominent  the  sample 
spectrum  is  at  that  point  in  comparison  with  the  general  level  in  the 
surrounding  frequency  band,  which  is  supposed  to  arise  from  a  continuous 
spectrum.  If  interest  lies  in  a  single  component  at  a  well  defined  frequency, 
well  away  from  the  ends  of  the  frequency  range,  then  the  task  is  quite 
reasonable  since  the  continuous  part  of  the  spectrum  should  be  adequately 
represented  and  well  estimated  using  an  ARMA  model  -  see  Campbell  and  Walker 
(1977).  If  however  there  are  several  discrete  components  covering  a  frequency 
band,  their  estimation  may  deplete  the  sample  spectrum  over  this  band  and 
distort  estimation  of  the  model  of  the  continuous  spectrum.  This  possibility 
is  accentuated  if  the  end  of  the  frequency  range  is  affected,  because  the  ARMA 
model  is  required  to  extrapolate  rather  than  to  interpolate  the  missing  part 
of  the  spectrum,  which  must  be  estimated  if  any  reasonable  inference  is  to  be 
made.  If  a  very  flexible  ARMA  model  is  chosen  (e.g.  a  large  number  of 
parameters),  then  estimation  using  criteria  such  as  exact  likelihood  can  be 
expected  to  make  the  model  fit  the  low  spectrum  in  the  depleted  band,  leading 
to  unrealistically  high  significance  levels  for  the  discrete  components.  What 
is  required,  is  an  ARMA  model  of  limited  flexibility,  yet  sufficiently 
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plausable  as  a  representation  of  any  true  underlying  continuous  spectrum. 

This  model  should  then  be  fitted  only  to  the  unaffected  part  of  the  spectrum, 
and  its  inter/extrapolation  over  the  depleted  range  used  to  assess 
significance.  This  is  precisely  what  Marginal  likelihood  achieves  -  it  uses 
just  that  information  which  cannot  be  affected  by  estimation  of  the  discrete 
components.  The  main  danger  is  that  there  is  insufficient  information  remain¬ 
ing  to  adequately  determine  some  aspects  (parameters  or  parameter 
combinations)  of  the  ARMA  model. 

We  again  turn  to  the  'daylength'  data  in  which  discrete  (sinusoidal)  com¬ 
ponents  have  been  found  -  see  Luo  et  al,  by  searching  for  the  highest  peaks  in 


the  sample  spectrum.  One  ought ,  because  of  this,  to  be  much  more  severe  in 
conceding  significance  of  any  coefficient,  but  as  we  shall  see  later  this  may 


not  be  a  major  factor  in  the  inference  problem  in  this  case.  We  shall 
therefore  treat  the  frequencies  as  fixed,  and  the  corresponding  periods  are 
given  in  Table  7  with  those  of  the  fundamental  and  first  seven  harmonics  of 
the  data  for  comparison,  in  fact  a  trend  plus  twelve  frequencies  were  used  by 
Luo  et  al,  but  eight  frequencies  are  quite  sufficient  to  illustrate  the  points 
here.  Although  it  is  tempting,  as  a  result  of  our  previous  model  fitting  to 
this  data  in  section  6,  to  work  with  the  differenced  data,  this  would 
completely  prejudice  the  outcome  of  the  analysis  by  diminishing  the  large 
variance  at  low  frequencies  which  is  the  focus  of  interest.  We  therefore 
retain  the  ARMA(2,2)  model  as  the  model  for  the  error  structure  in  a 


regression  upon  discrete  components  cos  u^t  and  sin  o^t  for 


1... 11*150.  Note  that  if  the  AR  operator  in  the  model  has  a  root  close  to 


unity  then  we  would  expect  much  lower  levels  of  significance  for  the  fitted 
components.  A  trend  term  was  also  included  to  match  the  analysis  of  Luo  et 
al.  A  sequence  of  estimations  was  then  carried  out,  pairs  of  cosine  and  sine 


components  with  coefficients  c^ ,  being  introduced  for  i  *  1,2,3...  , 

corresponding  to  the  modeling  periods  in  Table  7.  We  select  the  results  of 
estimations  with  4  such  pairs,  and  with  8  such  pairs,  repeated  for  exact  and 
marginal  likelihood  estimation.  In  the  case  of  marginal  likelihood  estimation 
with  8  pairs,  there  were  found  to  be  two  distinct  local  minima  in  the  ARMA 
model  parameter  space,  and  the  results  for  all  these  estimation  points  are 
summarised  in  Table  8.  A  "t"  value  is  given  besides  each  regression 
coefficient  estimate,  being  the  ratio  to  its  estimated  standard  error. 


Component  #  12  3  45  6  7  8 

Modeling  period  178.698  89.348  59.555  45.0  34.503  29.783  22.337  19.885 

Harmonic  period  150  75 


50 


37.5  30 


25 


21.43 


18.75 


4  component 
(Dev) 


4  component 
(Mev) 


8  component 
(Dev) 


8  component 
(Mev) 


8  component 
(Mev)B 


Dev/Mev 

30747.36 

29027.36 

21217.48 

24372.40 

24154.68 

RMS 

219.206 

249.293 

162.189 

246.050 

183.607 

RDF 

136 

136 

128 

128 

128 

+](SE) 

1.239( . 107 ) 

1.374 (.111) 

.990 ( . 110 ) 

1.397 ( . 122 ) 

1.077 ( . 120 ) 

♦2(SE) 

-.416( . 105) 

-. 383{ .117) 

-.499( . 101 ) 

-.397( . 142) 

-.  314( . 126) 

e. (sb) 

-.508( .094) 

-•557( .091 ) 

-.437{ .099) 

-.548( .095) 

-.557( . 102) 

e2(sE) 

-.615( .082) 

-.625( .079) 

—.612 ( .085) 

-.625( .079) 

-.628( .081 ) 

meant t) 

-690.4(2.44) 

-778.2(1.54) 

-1617.0(4.53) 

-1195.5(0.59) 

-1351.5(2.56) 

trend ( t) 

9.88(2.65) 

10.99(1.86) 

21.75(4.60) 

16.30(1.48) 

18.30(2.62) 

cl  ( t) 

156.4(1.57) 

181.00(0.87) 

408.81(3.33) 

284.49(0.90) 

328.86(1.80) 

sl(t) 

384.6(2.11) 

440.87(1.45) 

988.54(4.31) 

716.12(1.35) 

816.98(2.40) 

c2(t> 

223.0(2.82) 

253.52(1.82) 

443.65(4.59) 

342.09(1.47) 

378.99(2.61) 

•2(t) 

123.7(2.15) 

142.50(1.24) 

339.91(5.07) 

245.37(1.42) 

279.40(2.76) 

c3(t) 

-15.7(0.29) 

-.853(0.01) 

138.93(2.17) 

67.41(0.42) 

94.29(0.96) 

a3(  t) 

55.4(2.04) 

60.71(0.97) 

122.39(7.12) 

100.42(1.39) 

109.97(3.62) 

c4(t) 

52.8(1.64) 

61.95(1.04) 

145.56(4.65) 

106.45(1.20) 

122.11(2.42) 

«4(t) 

-53.8(2.04) 

54.73(1.06) 

-54.94(2.93) 

-47.37(0.74) 

-49.43(1.53) 

c5(t) 

58.35(3.06) 

35.56(0.59) 

45.24(1.36) 

85  (t> 

-12.55(0.99) 

-12.21(0.26) 

11.67(0.49) 

c6(t) 

71.46(6.80) 

61.79(1.50) 

66.83(3.25) 

a6(t) 

-4.32(0.31) 

6.38(0.12) 

1.63(0.06) 

c7(t) 

9.39(1.34) 

7.29(0.30) 

8. 19(0.66) 

87  (t) 

-22.19(3.15) 

-22.71(0.93) 

-23.47 (-1.68) 

c8(t) 

17.92(2.48) 

14.71(0.73) 

15.73(1.20) 

s8(t) 

43.20(6.42) 

41.73(2.06) 

41.68(3.29) 

Mote  that  using  exact  likelihood,  in  columns  2  and  4,  the  root  of  the  AR 
operator  rapidly  moves  away  from  unity.  There  are  also  some  extremely  large 
t  values,  although  the  estimates  are  not  particularly  stable  due  to  the 
presence  of  the  trend  term  with  which  they  are  correlated.  Moving  to  the 
marginal  likelihood  estimates,  the  AR  parameters  remain  close  to  those  for  the 
univariate  model,  in  columns  3  and  S,  and  the  t  values  are  small.  In  the 
last  column  however  the  picture  is  quite  different,  with  the  roots  of  f(B) 
now  away  from  unity  and  moderate  t  values.  Mote  however  that  the  Mev 
values  in  columns  5  and  6,  which  are  for  separate  local  minima  for  the  same 
model,  have  Mev  values  within  one  percent  of  each  other.  The  marginal  like¬ 
lihood  is  designed  for  inference  about  the  ARMA  parameters,  so  we  investigate 
further  the  structure  of  the  Mev  surface  as  a  function  of  the  two  most 
variable  parameters  ^  and  4^.  The  values  of  0^  and  02  which  change 
very  little  for  all  the  models  are  kept  fixed  at  (-.537,-.615). 
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Figure  8:  Contours  of  marginal  likelihood  of  AR  parameters 

in  model  for  daylength  variations  with  four  frequency 
components. 


Figures  8  and  9  show  contour  plots  of  this  surface  with  $  ^  on  the 
vertical  axis  and  on  the  horizontal  axis,  so  the  right  hand  edge 

represents  the  boundary  of  the  parameter  space  corresponding  to  a  root  of 
♦  (B)  at  B  ■  1.  Figure  8  is  the  map  of  Mev  for  4  pairs  of  discrete 
components  in  the  model.  The  minimum  is  on  the  boundary  with  no  hint  of  any 
second  local  minimum.  The  classical  confidence  region  would  be  given  by  the 
second  heavy  contour  above  the  minimum,  except  that  working  close  to  the 
boundary  the  regularity  conditions  for  properties  of  likelihood  estimates  will 
not  hold.  From  a  Bayesian  point  of  view  we  do  however  have  a  reasonable 
approximation  to  a  truncated  bivariate  Normal  for  which  such  a  contour  has  an 
acceptable  interpretation. 
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Figure  9: 


Contours  of  marginal  likelihood  of  AR  parameters,  in 
model  for  daylength  variations  with  eight  frequency 
components. 


Figure  9  is  the  map  of  Mev  for  8  pairs  of  discrete  components,  and  we 
see  a  quite  different  picture.  Besides  the  obvious  local  minimum  there  is  one 
on  the  boundary  at  ^  ~  -.35  and  -  1.  The  region  between  these  is 

however  very  flat,  the  contours  have  been  taken  at  fine  intervals  of  Mev  to 
reveal  the  structure.  In  this  case  the  continuous  contours  cover  a  region 
where  Mev  is  within  one  percent  of  its  minimum,  the  broken  contours  being 
outside  this.  One  interpretation  might  be  to  say  that  the  scientist  has  to 
provide  convincing  evidence  that  his  data  cannot  be  easily  explained  by  some 
random  mechanism  -  in  this  case  a  random  walk.  Then  there  is  very  little 
evidence  here  to  persuade  one  to  forego  the  assumption  that  the  parameters  lie 
on  the  boundary  minimum.  The  conclusion  regarding  the  discrete  cos^po’.ents 
then  comes  from  column  5  of  Table  8  ~  none  are  convincingly  significant.  This 
may  be  viewed  as  being  rather  tough  on  the  scientist  who  might  claim  that  the 
approach  used  here  could  reduce  any  group  of  discrete  spectrum  components  to 
insignificance,  by  choice  of  a  suitable  ABMA  error  structure.  There  is  not 
much  truth  in  this.  Our  contour  maps  are  invariant  to  the  magnitude  of  the 
discrete  components  in  the  model,  so  suitably  large  magnitudes  could  have  lead 
to  significant  coefficients  in  column  5  of  Table  8.  It  is  possible  for  a 
series  truly  to  consist  of  discrete  components  in  such  a  mixture  that  they  add 
up  to  look  just  like  a  random  walk,  but  that  would  be  just  very  bad  luck  for 
the  scientist. 

It  is  disconcerting,  but  not  totally  unexpected,  that  the  exact  likeli¬ 
hood  ratio  is  so  misleading  in  this  case.  Comparing  the  Dev  value  of 
37990.88  for  the  model  1  in  Table  5  with  the  Dev  value  of  21217.48  for 
the  8  component  model  in  column  4  of  Table  8,  a  mean  deviance  ratio  of  101.19 
would  be  referred  to  chi-square  on  17  d.f,  and  taken  as  strong  evidence  of 
significant  discrete  components.  To  check  that  this  was  not  due  simply  to  the 


••lection  of  the  frequencies  by  inspection  of  the  sample  spectrum  peaks,  the 
data  was  refitted  using  the  harmonic  frequencies  corresponding  to  the  periods 
in  Table  7.  An  even  smaller  Dev  value  of  19691.95  resulted  with  very 
similar  AHMA  parameter  values.  As  a  further  check  a  series  of  length  150  was 
simulated  following  the  model  2  as  shown  in  Table  4,  i.e.  with  a  random  walk 
component.  This  is  shown  in  Figure  10  to  be  broadly  similar  to  the  day length 
series.  After  fitting  the  8  component  model  by  exact  likelihood,  a  picture 
very  close  to  that  in  column  4  of  Table  8  emerged,  with  t  values  over  10.0. 
The  harmonic  frequencies  were  used  here,  and  a  mean  deviance  ratio  of  81.8 
resulted.  Using  Marginal  likelihood,  the  Mev  surface  was  again  very  flat 
over  a  neighbourhood  of  the  boundary  *  1,  with  one  local  minimum  on 

the  boundary,  slightly  higher  than  a  second  minimum  a  short  way  off  the 
boundary. 


TIME 

Figure  10:  Simulated  series  from  model  2  in  table  5. 


The  grouping  of  a  large  number  of  discrete  components  at  the  end  of  the 
frequency  range  Is  In  large  part  the  cause  of  the  difficulty.  If  for  example 
the  discrete  components  6,  7  and  8  alone  are  fitted  to  the  simulated  data 
using  exact  likelihood,  reasonable  results  are  found,  with  a  mean  deviance 
ratio  of  8.17  being  referred  to  chi-square  on  7  d.f. 

A  Bayesian  approach  may  be  applied  In  this  case  as  for  the  regression 
parameter  In  the  peak  sunspot  series  of  section  7.  It  Is  clear  then  that  the 
high  't'  values  will  be  lost,  and  the  low  *  t*  values  will  tend  to 
dominate.  Taking  a  crude  50:50  mix  at  the  two  local  minima  In  Figure  9  to 
approximate  the  posterior  density  of  »$2),  w  see  for  example  that  for  a 

A  A 

typical  regression  coefficient  with  a  >  0,  p(a<0|y)  >  1/2  P(a<O|y,0  *  P ^ ) 

A 

where  0 ^  is  the  boundary  point  estimate  for  which  the  t  value  is  small  and 
the  probability  moderately  large. 

10.  CONCLUSION 

We  have  sought  to  demonstrate  that  marginal  likelihood  is  a  useful  tool 
for  dealing  with  a  very  practical  problem  -  the  estimation  of  the  error 
structure  in  regression  models.  For  those  not  satisfied  with  the  Bayesian 
interpretation,  there  remains  of  course  the  question  of  the  sampling 
properties  of  the  estimates  obtained  using  this  criterion.  There  are  other 
effects  of  using  marginal  likelihood,  in  particular  the  residual  auto¬ 
correlations  may  not  (and  perhaps  should  not)  look  unduly  ’white' ,  since  the 
estimation  does  not  attempt  to  repair  any  distortion  of  the  spectrum  due  to 
the  regression. 

In  only  one  example  was  a  non-deterministic  regressor  used  (the  peak  sun¬ 
spot  numbers).  This  is  because  in  time  series  applications,  distributed  lag 
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or  transfer  function  Models  are  often  appropriate  when  true  observed 
regressors  are  used.  For  such  models  with  non-linear  regression  parameters, 
it  is  not  immediately  clear  how  to  extend  the  marginal  likelihood.  The 
Bayesian  approach  is  always  possible  provided  suitable  priors  can  be  agreed. 
Local  linearisation  might  be  a  good  initial  step,  and  leads  to  formulae  such 
as  proposed  by  Leonard  (1982)  to  approximate  the  results  of  integrating  out 
nuisance  parameters. 

The  software  for  marginal  likelihood  estimation  used  for  the  examples  in 
this  paper  is  available  in  the  Genstat  package  and  subroutine  library 
distributed  by  the  Numerical  Algorithms  Group,  Oxford. 

Although  the  use  of  marginal  likelihood  or  its  Bayesian  equivalent  has 
been  advocated  for  many  years,  in  the  papers  of  Zellner  and  Tiao,  Levenbach, 
and  Cooper  and  Thompson,  it  has  been  somewhat  neglected.  It  is  hoped  that 
this  paper  will  help  to  stimulate  more  interest. 
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